Introduction

This project will examine data from Ken French’s website which provides information on the average firm size, monthly return, and beta for 49 different industries, as well as the overall market return and risk free rate. The questions of the project are two-fold: First, we wish to assess whether the returns of different industries behave similarly (do they have similar mean returns, are they correlated, do they have similar book equity to market equity ratios, etc). Second, we wish to assess whether the Capital Asset Pricing Model accurately predicts the cross sectional returns of the various industries.

CAPM is a model that predicts a portfolio’s returns given the portfolio’s market risk. We will test the CAPM using the method pioneered in Fama-Macbeth (1973) by first calculating each industry’s market beta, and then running cross sectional regressions to see whether beta explains the average returns on each industy.

Data

BE <- read.csv("Average BE .csv")
size <- read.csv("Average Firm Size Monthly.csv")
returns <- read_excel("Average Weighted Returns Monthly.xlsx")
rf <- read.csv("RP-RF.csv")

The variables we plan to use and create in this dataset are: * Year * Industry Percent Return - The percent return on investment for a given industry in a given year * Industry (the 49 individual industries as well as composite industries which we will create later during the data cleaning process) * Average Firm Size - The average market capitalization value of a firm in a given year represented in millions of dollars * Book Equity / Market Equity - * Market Return - Defined as “The return on the overall theoretical market portfolio which includes all assets and having the portfolio weighted for value.” * Risk Free Rate - Defined as “the rate of return a hypothetical investment with no risk of financial loss” * Beta - Defined as “a measure of a stock’s volatility in relation to the market” The formula for Beta is:

\(\beta_i = \frac{Cov(R_e, R_m)}{Var(R_m)}\)

where \(R_e\) = the return on an individual portfolio and \(R_m\) = the market return

The data sets we collected were formatted in such a way that required extensive data cleaning before beginning our analysis.

First, we needed to resolve cases of incomplete data across all of our data sets.

Second, the “returns” and “size” data was provided given on a monthly level from 1926 to 2009, and we needed to aggregate on a yearly level by applying the average function on all the months in each given year.

Third, the data provided in the four different data sets, “BE” (Book to Equity), “size” (Market Value), “returns” (returns in percent for the given period), “rf” (market return for that period), was organized horizontally, with years as rows, each industry (e.g. Agric, Food, Soda…) as columns, and then returns by year and industry in the middle. We needed to aggregate the various industry columns names into one “industry” column, which required transposing each data set.

Fourth, in the “BE”, “size”, and “returns” data sets, there were 50 unique industries. When we took a closer look, it was clear that the industries were quite specific, so we needed to aggregate these industries into larger industry categories which we called “Composite Industries” (e.g. Healthcare, Industrials, and TMT). This industry aggregation process required averaging book to equity ratios, returns, sizes of each industry within Composite Industries.

Fifth, we created a new “beta” variable based on the formula indicated above for each composite industry for each year.

Sixth, once we formatted each of the four data sets in similar manners, we merged them into one data frame to be able to run tests and analyses on the data.

Data Cleaning

First, across all datasets, when an observation has a firm size, percent return, or beta ratio of zero, the dataset reads -99.99. We will begin by transforming all of these values to zero.

BE[BE== -99.99] <- NA
size[size == -99.99] <- NA 
returns[returns == -99.99] <- NA 
rf[rf == -99.99] <- NA

Next, we will reformat the year variable to include only the year as oppossed to year-month while finding the yearly averages for the “returns” and “size” dataframes. We also decided to cut off the last year in the our date variables in the Returns and Size dataframes. We will also find yearly averages for the returns and size dataframes to standardize all the dataframes of them by year.

size$Year <- substr(size$Year, 1, 4) #Strip off all ofthe month values 
size <- aggregate(.~Year, data = size, mean,  na.action = na.pass)
total_return <- function(returns_array){
  final_return <- 1
  for (individual_return in returns_array){
    if (!is.na(individual_return) && !is.na(final_return)){
      final_return <- final_return*(1+individual_return/100)
    }
    else{
      final_return <- NA
    }
  }
  if (!is.na(final_return)){
    final_return <- final_return - 1
  }
  return(final_return)
}
returns$Year <- substr(returns$Year, 1, 4) #Ask what is going on here
returns <- aggregate(.~Year, data = returns, total_return, na.action = na.pass)
rf$Year <- substr(rf$Year,1,4)
rf <- aggregate(.~Year, data= rf, total_return,  na.action = na.pass)
rf$mktret <- rf$Mkt.RF+rf$RF #This creates the column for market return - done by adding back the risk free rate to the risk free market return

#We want the data at each time click to be the data that would have been available to someone at that time. Because size and BE/ME are backwards looking calculations but predictive variables of returns, they contain information from the future that would not have been avaialble to someone trying to make a prediction. Therefore, we lag both of these variables by one year.
BE$Year <- as.integer(BE$Year) + 1
size$Year <- as.integer(size$Year) + 1
#Kick off the last row so that we have same sized datasets and adjust for rf and returns not being lagged.
BE <- BE[1:88,]
size <- size[1:88,]
rf <- rf[2:89,]
returns <- returns[2:89,]

Before using our datasets, we need to convert them from wide form to long form. Using the melt function, we alter our dataframe such that it transforms from a 89x50 dataframe to a 4361x3 dataframe. We then adjust the names of the columns, eliminate repeated columns, and repeat for our other dataframes (returns and size).

BE <- melt(BE, id = c("Year"))
BE$Ind <- BE$variable
BE$variable <- NULL
BE$BEMERatio <- BE$value
BE$value <- NULL
size <- melt(size, id = c("Year"))
size$size <- size$value
size$value <- NULL
size$Ind <- size$variable 
size$variable <- NULL 
returns <- melt(returns, id = c("Year"))
returns$returns <- returns$value
returns$value <- NULL 
returns$Ind <- returns$variable
size$variable <- NULL

Once we have transformed the dataframes into the appropriate form, we can then merge the datasets together into the final, usable dataframe. Because year and industry are repeated so many times in each of the datasets, to merge, we must create a YearInd variable that represents an industry in a unique year. After merging, we delete the repeated columns and rename the remaining ones.

size$YearInd <- paste(size$Year, size$Ind, sep = "_")
BE$YearInd <- paste(BE$Year, BE$Ind, sep = "_")
returns$YearInd <- paste(returns$Year, BE$Ind, sep = "_")
df <- merge(size, BE, by = "YearInd")
df <- merge(df, returns, by = "YearInd")
df$YearInd <- NULL
df$Year.y <- NULL 
df$Ind.y <- NULL 
df$variable <- NULL
df$Year <- NULL
df$Ind.x <- NULL
names(df)[1] <- "Year"

To expand our realm of analysis, we have decided to create a number of variables: composite industry (categorical) and Beta (continuous). Below is the series of code used to create our new variables. We begin with our composite industry variable which recodes our 49 independent industries into eight overarching industries (Healthcare, Industrials, TMT, Financials, Consumer, Utilities, Commodities, and Other).

df$CompInd <- recode(df$Ind, "'Hlth' = 'Healthcare'; 'MedEq' = 'Healthcare'; 'Drugs' = 'Healthcare'; 'Chems' = 'Healthcare'; 'LabEq' = 'Healthcare'; 'Rubbr' = 'Industrials'; 'BldMt' = 'Industrials'; 'Cnstr' = 'Industrials'; 'Mach' = 'Industrials'; 'ElcEq' = 'Industrials'; 'Autos' = 'Industrials'; 'Aero' = 'Industrials'; 'Ships' = 'Industrials'; 'Mines' = 'Industrials'; 'FabPr' = 'Industrials'; 'Fun' = 'TMT'; 'Telcm' = 'TMT'; 'PerSv' = 'TMT'; 'BusSv' = 'TMT'; 'Softw' = 'TMT'; 'Chips' = 'TMT'; 'Paper' = 'TMT';'Hardw' = 'TMT'; 'Banks' = 'Financials'; 'Insur' = 'Financials'; 'RlEst' = 'Financials'; 'Fin' = 'Financials'; 'Food' = 'Consumer'; 'Soda' = 'Consumer'; 'Beer' = 'Consumer'; 'Smoke' = 'Consumer'; 'Books' = 'Consumer'; 'Clths' = 'Consumer'; 'Hshld' = 'Consumer'; 'Meals' = 'Consumer'; 'Rtail' = 'Consumer'; 'Util' = 'Utilities'; 'Gold' = 'Commodities'; 'Coal' = 'Commodities'; 'Oil' = 'Commodities'; 'Steel' = 'Commodities'; 'Txtls' = 'Commodities'; 'Other' = 'Other'; 'Trans' = 'Other'; 'Boxes' = 'Other'; 'Guns' = 'Other'; 'Whlsl' = 'Other'; 'Agric' = 'Other'; 'Toys' = 'Other'")

Next, we create Beta for each industry, which represents the covariance of market returns and a given industry returns divided by the variance in market returns.

df <- merge(df, rf, by = "Year")
df$Year <- as.factor(df$Year)
df$Beta <- NA
for(i in unique(df$Ind)){
  temp <- df[df$Ind == i,]
  a <- (cov(temp$returns,temp$mktret))/(var(temp$mktret))
  df$Beta[df$Ind==i] <- a
}

Here, we create all of the composite variables for our dataset such as Composite Industry Returns, Composite Size, and Composite BE/ME ratio. We then merge these into a new dataframe.

indreturnscomp <- aggregate(df$returns ~ df$CompInd + df$Year, df, mean)
colnames(indreturnscomp) <- c("CompInd", "Year", "Composite Returns")
aggcompret <- merge(df, indreturnscomp, all.x=T, by = c("Year", "CompInd"))

#composite industry size 
indsizecomp <- aggregate(df$size ~ df$CompInd + df$Year, df, mean)
colnames(indsizecomp) <- c("CompInd", "Year", "Composite Size")
#indsizecomp
aggcompsize <- merge(aggcompret, indsizecomp, all.x=T, by = c("Year", "CompInd"))

#composite BE/ME ratio
indbemecomp <- aggregate(df$BEMERatio ~ df$CompInd + df$Year, df, mean)
colnames(indbemecomp) <- c("CompInd", "Year", "Composite BE/ME Ratio")
aggcompbeme <- merge(aggcompsize, indbemecomp, all.x=T, by = c("Year", "CompInd"))

#composite beta and final dataframe
indbetacomp <- aggregate(df$Beta ~ df$CompInd + df$Year, df, mean)
colnames(indbetacomp) <- c("CompInd", "Year", "Composite Beta")
aggcompfin <- merge(aggcompbeme, indbetacomp, all.x=T, by = c("Year", "CompInd"))

#final dataframe
aggcompfin$Year <- as.numeric(aggcompfin$Year)

Graphics

First, we examine a boxplot between firm size and composite industry to examine if there are noticeable differences in the log size between the composite industries. We would need to use a t-test of some sort to examine whether these differences were statistically significant. We added 1 inside the log because some of the firm sizes are technically zero, which causes an error given that we are taking the natural log.

boxplot(log(df$size + 1)~df$CompInd, main = "Boxplot of Composite Industry and Size", col = c("red", "blue", "green", "yellow","orange", "magenta", "turquoise", "lime green"), ylab = "Log of Size", xlab = 'Composite Industry', las = 2, cex.axis = .5) 

Overall, the industries appear to be relatively similar and there are no observable skews within any of the industries. All of the industries also seem to be approximately normally distributed, but next, we will confirm if the data has a normal distribution within each composite industry.

We first examine the average returns per composite industry by year from 1926-2014.

aggcompfin$CompInd <- as.factor(aggcompfin$CompInd)
plot(aggcompfin$`Composite Returns` ~ aggcompfin$Year, main = "Scatterplot of Year versus Return 1926-2014", pch = 24, col = aggcompfin$CompInd, xlab = "Year", ylab = "Yearly Returns")
legend("topright", legend= levels(aggcompfin$CompInd), pch=16, cex = 0.6, col = c(1:9))

# lines() - How do we add in lines

Overall, returns seem relatively flat throughout the entire period. Given the number of composite industries, it is difficult to see any single trend in the graph. As such, we will have to run regression analyses by industry to verify if there are trends for each.

Next, we examine the Book Equity / Market Equity Ratio for the nine composite industries throughout the 1926-1979 period. The changes in this financial ratio will give us an idea of the market valuations within each industry over time, to see if there were any valuation bubbles throughout the years.

plot(aggcompfin$`Composite BE/ME Ratio` ~ aggcompfin$Year, main = "BE/ME Ratio for Each Industry by Year", xlab = "Year", ylab = "Average BE/ME", col = factor(aggcompfin$CompInd))
legend("topright", col= c(1:9), legend= levels(aggcompfin$CompInd), pch=16, cex = 0.6)

#Add in the lines

It seems as though healthcare was the lowest valued industry relative to the assets on their balance sheets. A couple of industries, including Commodities, Fincnaicls, Consumer, and “Other” seemed to go through a spike in valuation around the late 1930s. This is probably because of the recovery of stock prices following the cataclysmic drop that bottomed out in May 1932 during the Great Depression.

Now, we will examine graphs within specific composite industries to observe how our continuous variables have changed throughout time. We have elected two composite industries at random (TMT and Commodities) in order to limit the number of graphs we are making.

#Making a new dataframe that has one value per composite industry
single <- data.frame(Year = c(1:88), Composite_Returns_TMT = NA, Composite_Size_TMT = NA, Composite_BEME_Ratio_TMT = NA, Composite_Beta_TMT = NA, Composite_Returns_Com = NA, Composite_Size_Com = NA, Composite_BEME_Ratio_Com = NA, Composite_Beta_Com = NA)
for(i in 1:length(unique(aggcompfin$Year))){
  temp1 <- aggcompfin[aggcompfin$Year == i,]
  a <- tapply(temp1$`Composite Size`, temp1$CompInd, mean)
  single$Composite_Size_TMT[i] <- a[8]
  b <- tapply(temp1$`Composite Returns`, temp1$CompInd, mean)
  single$Composite_Returns_TMT[i] <- b[8]
  c <- tapply(temp1$`Composite BE/ME Ratio`, temp1$CompInd, mean)
  single$Composite_BEME_Ratio_TMT[i] <- c[8]
  d <- tapply(temp1$`Composite Beta`, temp1$CompInd, mean)
  single$Composite_Beta_TMT[i] <- d[8]
  single$Composite_Size_Com[i] <- a[1]
  single$Composite_Returns_Com[i] <- b[1]
  single$Composite_BEME_Ratio_Com[i] <- c[1]
  single$Composite_Beta_Com[i] <- d[1]
}

barplot(single$Composite_BEME_Ratio_TMT, main = "Barplot of TMT BE/ME Ratio from 1926-2014", col = "blue", ylab = "Composite BE/ME for TMT")

barplot(single$Composite_Returns_TMT, main = "Barplot of TMT Average Yearly Returns from 1926-2014", col = "green", ylab = "Composite Returns for TMT")

barplot(single$Composite_Size_TMT, main = "Barplot of TMT Composite Size from 1926-2014", col = "green", ylab = "Composite Average Size for TMT")

We conclude from the three graphs that 1) TMT company valuations spiked in the 1940s, 2) Average yearly returns in the TMT industry were net positive from 1926-1979, and 3) the market value of TMT companies decreased during the Great Depression and subsequently grew from 1940s to 1970s.

Now, we examine the same graphs for the Commodities industry.

barplot(single$Composite_BEME_Ratio_Com, main = "Barplot of Commodities BE/ME Ratio from 1926-2014", col = "green", ylab = "Composite BE/ME for Commodities")

barplot(single$Composite_Returns_Com, main = "Barplot of Commodities Composite Returns from 1926-2014", col = "green", ylab = "Composite Average Yearly Returns for Commodities")

barplot(single$Composite_Size_Com, main = "Barplot of Commodities Size from 1926-2014", col = "green", ylab = "Composite Size for Commodities")

We conclude from the three graphs that 1) Commodities company valuations spiked in the 1930s before the worst of the Great Depression, 2) Average yearly returns in the Commodities industry were net positive from 1926-1979, and 3) the market value of Commodities companies grew consistently from 1940s to 1970s.

We have also elected to examine some of the histograms for our composite variables to get a general sense of the distribution and see if any transformations may be necessary.

hist(aggcompfin$`Composite Returns`, main = "Histogram of Composite Returns", xlab = "Average Yearly Composite Returns", ylab = "Frequency", col = "blue", xlim = c(-2,2))

hist(aggcompfin$`Composite BE/ME Ratio`, main = "Histogram of Composite BE/ME Ratio", xlab = "Composite BE/ME Ratio", ylab = "Frequency", col = "orange", xlim = c(0,7))

hist(aggcompfin$`Composite Beta`, main = "Histogram of Composite Beta", xlab = "Composite Beta", ylab = "Frequency", col = "green", xlim = c(0.1,1))

hist(aggcompfin$`Composite Size`, main = "Histogram of Composite Size", xlab = "Composite Average Size Yearly", ylab = "Frequency", col = "red")

We have created histograms of the four composite variables in our dataset. The histogram for composite returns appear relatively normally distributed. By contrast, the histogram of composite BE/ME Ratio is right skewed as is the histogram of composite returns whereas the histogram of composite beta is more left skewed.

Finally, we examine the normal quantile plots for our composite industries to see if we have normal distribution within each composite industry.

qqPlot(aggcompfin$`Composite Returns`~aggcompfin$CompInd)

qqPlot(aggcompfin$`Composite Size`~aggcompfin$CompInd)

qqPlot(aggcompfin$`Composite BE/ME Ratio` ~ aggcompfin$CompInd)

[Write analysis and fix graph titles]

Basic Tests

Because our data does not include any two-level categorical groups, we find it more useful to examine the correlations between our different continous composite variables which we made graphs for above. However, since our new dataframe only has two industries, we can use a basic t-test to compare the four composite variables.

#Basic boxplots to get an inclination of the differences 
boxplot(single$Composite_Returns_TMT, single$Composite_Returns_Com, ylab = "Composite Returns", main = "Average Yearly Composite Returns (TMT vs Commodities)", col = "green", names = c("TMT", "Commodities"))

boxplot(single$Composite_Size_TMT, single$Composite_Size_Com, ylab = "Composite Size", main = "Average Yearly Composite Size (TMT vs Commodities)", col = "red", names = c("TMT", "Commodities"))

boxplot(single$Composite_BEME_Ratio_TMT, single$Composite_BEME_Ratio_Com, ylab = "Composite BE/ME Ratio", main = "Composite BE/ME Ratio (TMT vs Commodities)", col = "blue", names = c("TMT", "Commodities"))

#boxplot(single$Composite_Beta_TMT, single$Composite_Beta_Com, ylab = "Composite Beta", main = "Composite Beta (TMT vs Commodities)", col = "orange", names = c("TMT", "Commodities"))

t.test(single$Composite_Returns_TMT, single$Composite_Returns_Com)
## 
##  Welch Two Sample t-test
## 
## data:  single$Composite_Returns_TMT and single$Composite_Returns_Com
## t = -0.35297, df = 172.05, p-value = 0.7245
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.08092492  0.05637315
## sample estimates:
## mean of x mean of y 
## 0.1140227 0.1262986
t.test(single$Composite_Size_TMT, single$Composite_Size_Com)
## 
##  Welch Two Sample t-test
## 
## data:  single$Composite_Size_TMT and single$Composite_Size_Com
## t = 2.0075, df = 148.27, p-value = 0.04651
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##    6.840586 867.186759
## sample estimates:
## mean of x mean of y 
## 1140.5679  703.5542
t.test(single$Composite_BEME_Ratio_TMT, single$Composite_BEME_Ratio_Com)
## 
##  Welch Two Sample t-test
## 
## data:  single$Composite_BEME_Ratio_TMT and single$Composite_BEME_Ratio_Com
## t = -2.3542, df = 110.18, p-value = 0.02033
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.49543708 -0.04256292
## sample estimates:
## mean of x mean of y 
## 0.9694318 1.2384318
# t.test(single$Composite_Beta_TMT, single$Composite_Beta_Com) - Does not work because data is essentially constant. Based on the boxplot, the Betas do appear to be different between the industries. 

Based on the t-test, Composite Returns between TMT and Commodities are not statistically significantly different at any reasonable level of alpha. By contrast, size for the two industries is statistically significant at alpha = 0.05 as the p-value is 0.04651. Based on the boxplot, it appears that TMT is larger than commodities for size. Finally, the BE/ME Ratio between the two industries is statistically significantly different at the 0.05 alpha level as the p-value is 0.02033. Based on the boxplot, the BE/ME Ratio for Commodities is larger than TMT. This makes intuitive sense because tech stocks are typically valued at higher multiples of their revenue because more future growth is priced in.

#Now we check for significant correlations at the 95% level and plot them. 
sigcorr <- cor.mtest(aggcompfin[,c(9, 11:14)], conf.level = .95)
corrplot.mixed(cor(aggcompfin[,c(9, 11:14)]), lower.col="black", upper = "ellipse", tl.col = "black", number.cex=.7, 
               order = "hclust", tl.pos = "lt", tl.cex=.7, p.mat = sigcorr$p, sig.level = .05) #Adjust the top axis text

Overall, the correlations between variables are relatively low. The highest correlation is between market return and composite returns for all of the industries. Composite Beta and Composite BE/ME ratio both have a low negative correlation with composite size, but the correlation is statistically significant at the 95% level. Other than these several correlations, the others are too low to hold much interpretation. We will proceed by analyzing these three correlations through bootstraps and permutation tests.

Briefly, let’s look on non-linearity, correlation, and histograms all at once for our variables of interest.

chart.Correlation(aggcompfin[,c(9, 11:14)], histogram=TRUE, pch=19)

The graph does not provide much more information. Although we see that the correlation between composite returns and market return is quite linear, which explains the strong correlation. [ADD MORE HERE].

Now, we proceed to construct a bootstrapped correlation interval for the correlation between market return and composite industry return. [ASK ABOUT THIS PART]

Permutation Test

We now conduct a permutation test between composite beta and composite size in order to show a non-zero statistically significant correlation between these two variables. [ASK ABOUT NORMAL DISTRIBUTION OUTLIERS ETC]

myCor <- function(x,y){
  plot(x,y,pch=19, col="red", xlab = "Composite Beta", ylab = "Composite Size")
  mtext(paste("Sample Correlation =", round(cor(x,y),3)), cex=1.2)
}

permCor <- function(x, y, n_samp = 10000, plot = T){
   corResults <- rep(NA, n_samp)
   for (i in 1:n_samp){
      corResults[i] <- cor(x, sample(y))
   }
   pval <- mean(abs(corResults) >= abs(cor(x,y)))
   if (plot == T){
      #Make histogram of permuted correlations
      hist(corResults, col = "yellow", main = "", xlab = "Correlations", breaks = 50,
           xlim = range(corResults,cor(x,y)))
      mtext("Permuted Sample Correlations", cex = 1.2, line = 1)
      mtext(paste("Permuted P-value =",round(pval,5)), cex = 1, line = 0)
      abline(v = cor(x,y), col="blue", lwd=3)
      text(cor(x,y)*.97, 0,paste("Actual Correlation =", round(cor(x,y),2)),srt = 90, adj = 0)
   }
   if (plot == F){
      return(round(pval,5))
   }  
}

myCor(aggcompfin$`Composite Beta`, aggcompfin$`Composite Size`)

permCor(aggcompfin$`Composite Beta`, aggcompfin$`Composite Size`, n_samp = 10000, plot = T)

Based on the permuted sample correlations, the actual correlation between composite beta and composite size appears to be significant at any reasonable level of alpha.

Regression Testing CAPM

As mentinoed in the introduction, CAPM is a model for returns that says the excess returns of a portfolio (returns greater than the risk free rate) depend solely on the amount of market risk associated with that portfolio. Market risk can be estimated by calculating a portfolio’s beta with the market, as we have already done by taking the covariance of the market and each portfolio scaled by the standard deviation of each. Now, to test the CAPM we will use the model \(R_{i} = \gamma _{0} + \gamma _{1}\beta _{i}\), to represent a single cross section of returns for all the industries \(i\) in our data set. We will run this regression for each cross section and then estimate \(\gamma _{0}\) and \(\gamma _{1}\) from the means we get from each iteration of the cross sectional regression. We can then test the CAPM by testing whether the means for \(\gamma _{0}\) and \(\gamma _{1}\) are equal to 0 using a t test. If the returns are fully explained by market risk, or \(\beta _{i}\), then \(\gamma _{0}\) should be approximately equal to 0 (technically it should be equal to the risk free rate, but since that is so close to 0 we can approximate it as such for ease of calculation). This is because if a portfolio is compensated for reasons other than market risk, \(\gamma _{0}\) will capture those other sources of return by being larger than 0. Therefore, we will reject the CAPM if \(\gamma _{0}\) is statistically significantly different from 0. We will also test if \(\gamma _{1}\) is significantly different from 0, because if market risk predicts returns then its coefficient (\(\gamma _{1}\)) should be different from 0.

intercepts <- c()
coefficients <- c()
for (i in 1:length(unique(df$Year))){
  year <- unique(df$Year)[i]
  #print(df$returns[df$Year == year])
  temp_df <- na.omit(data.frame(df$returns[df$Year == year], df$Beta[df$Year == year]))
  names(temp_df) <- c('returns', 'beta')
  intercepts <- append(intercepts, as.double(lm(temp_df$returns ~ temp_df$beta)$coefficients[1]))
  coefficients <- append(coefficients, as.double(lm(temp_df$returns ~ temp_df$beta)$coefficients[2]))
}

hist(coefficients, main = 'Distribution of Gamma_1', xlab = 'Gamma_1')

hist(intercepts, main = 'Distribution of Gamma_0', xlab = 'Gamma_0')

means <- c(mean(intercepts), mean(coefficients))
sds <- c(sd(intercepts), sd(coefficients))
tstats <- c(mean(intercepts) / (sd(intercepts)/sqrt(length(intercepts))), mean(coefficients) / (sd(coefficients)/sqrt(length(coefficients))))
results_table <- data.frame(c('Gamma_0', 'Gamma_1'), means, sds, tstats)
names(results_table) <- c('Variable', 'Mean', 'SE', 't-stat')



print('The following table contains the resulting means, standard errors, t statistics and P-values for Gamma_0 and Gamma_1')
## [1] "The following table contains the resulting means, standard errors, t statistics and P-values for Gamma_0 and Gamma_1"
results_table
##   Variable       Mean        SE    t-stat
## 1  Gamma_0 0.10856513 0.2035720 5.0028070
## 2  Gamma_1 0.02577306 0.2777121 0.8705876

From the table, we can see that our test clearly rejects the CAPM. At a significance level of alpha = 0.05, the critical value for a 2 sided test would be approximately 2, meaning a t stat of 5 for \(\gamma _{0}\) means we reject the null that it is equal to 0, and therefore we reject that the CAPM fully explains the cross sectional returns. Similarly, we see that the t stat for \(\gamma _{1}\) is much lower than the critical value of 2 (or even the critical value for a 1 sided test, which would be 1.645). Therefore, we fail to reject the null that \(\gamma _{1}\) is different from 0. Again, this is evidence against the CAPM.

Now, we will attempt to find the best model to predict returns, given the data we have. This means we will use the industries, composite industries, BE/ME, and size measures in addition to beta and market returns. Some important things to note: measures of size and book equity to market equity will be lagged by 1 time step because that data would not have been available at the time of prediction. We should also note that when calculating beta, we used the entire sample of returns, which we did to get the most accurate measure of each industry’s beta. However, that inevitably includes forward looking data, meaning that in practice the predictive power of beta will likely be lower out of sample. We will be doing a time series regression this time (as opposed to cross sectional), meaning that we won’t do a seperate regression for each year. Instead, our aim will be to predict returns in any time period given the variables.

na_free_df <- na.omit(df)
mod1 <- lm(na_free_df$returns ~ na_free_df$size + na_free_df$BEMERatio +  na_free_df$Beta + na_free_df$CompInd + na_free_df$Ind + na_free_df$size*na_free_df$Beta + na_free_df$size*na_free_df$BEMERatio + na_free_df$BEMERatio*na_free_df$Beta + na_free_df$CompInd*na_free_df$size)
#  na_free_df$Ind + na_free_df$mktret + na_free_df$RF + na_free_df$Mkt.R
mod1 <- lm(na_free_df$returns ~ na_free_df$size + na_free_df$CompInd + na_free_df$size*na_free_df$Beta + na_free_df$size*na_free_df$BEMERatio + na_free_df$BEMERatio*na_free_df$Beta)


mod2 <- lm(na_free_df$returns ~ na_free_df$BEMERatio + na_free_df$CompInd + na_free_df$mktret + na_free_df$RF)

summary(mod1)
## 
## Call:
## lm(formula = na_free_df$returns ~ na_free_df$size + na_free_df$CompInd + 
##     na_free_df$size * na_free_df$Beta + na_free_df$size * na_free_df$BEMERatio + 
##     na_free_df$BEMERatio * na_free_df$Beta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54247 -0.17666 -0.00017  0.16546  2.26295 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                           7.197e-02  4.165e-02   1.728
## na_free_df$size                       1.781e-06  4.571e-06   0.390
## na_free_df$CompIndConsumer            4.695e-02  1.995e-02   2.353
## na_free_df$CompIndFinancials          1.841e-02  2.262e-02   0.814
## na_free_df$CompIndHealthcare          6.235e-02  2.362e-02   2.640
## na_free_df$CompIndIndustrials         4.363e-02  2.012e-02   2.169
## na_free_df$CompIndOther               9.601e-03  2.037e-02   0.471
## na_free_df$CompIndTMT                 4.836e-02  2.181e-02   2.218
## na_free_df$CompIndUtilities           7.127e-03  3.603e-02   0.198
## na_free_df$Beta                      -1.949e-02  3.393e-02  -0.574
## na_free_df$BEMERatio                  8.048e-03  2.747e-02   0.293
## na_free_df$size:na_free_df$Beta      -4.253e-06  4.810e-06  -0.884
## na_free_df$size:na_free_df$BEMERatio  8.019e-06  9.806e-06   0.818
## na_free_df$Beta:na_free_df$BEMERatio  4.921e-02  2.363e-02   2.083
##                                      Pr(>|t|)   
## (Intercept)                           0.08412 . 
## na_free_df$size                       0.69682   
## na_free_df$CompIndConsumer            0.01867 * 
## na_free_df$CompIndFinancials          0.41581   
## na_free_df$CompIndHealthcare          0.00833 **
## na_free_df$CompIndIndustrials         0.03018 * 
## na_free_df$CompIndOther               0.63734   
## na_free_df$CompIndTMT                 0.02665 * 
## na_free_df$CompIndUtilities           0.84320   
## na_free_df$Beta                       0.56577   
## na_free_df$BEMERatio                  0.76956   
## na_free_df$size:na_free_df$Beta       0.37665   
## na_free_df$size:na_free_df$BEMERatio  0.41355   
## na_free_df$Beta:na_free_df$BEMERatio  0.03734 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2933 on 3506 degrees of freedom
## Multiple R-squared:  0.03411,    Adjusted R-squared:  0.03053 
## F-statistic: 9.525 on 13 and 3506 DF,  p-value: < 2.2e-16

ANOVA, Logit, Multinomial, or Webscraping

For this portion of the project, we will use ANOVA to examine the differences of composite industry over composite size. We are curious about the pairs of differences for Composite Industry with regards to the Composite Size variable. Thus, our ANOVA will focus on the composite size variable, although we may choose another one for some

Earlier, we examined the boxplot of composite industry by size and it appeared that the boxplots were relatively even. We start by examing the sample standard deviations across composite industries to see if we pass our assumptions for ANOVA (ie. the max to min ratio of standard deviations is less than 2).

print("SD by Genre")
## [1] "SD by Genre"
(sds <- tapply(aggcompfin$`Composite Size`, aggcompfin$CompInd, sd))
## Commodities    Consumer  Financials  Healthcare Industrials       Other 
##   1097.8887   3362.6497   1055.5974    755.9716    922.8877   1250.9710 
##         TMT   Utilities 
##   1253.0774   1718.6425
print("Ratio of Max/Min Sample SD")
## [1] "Ratio of Max/Min Sample SD"
round(max(sds)/min(sds),1)
## [1] 4.4

The ratio is 4.4. Thus, our assumption for ANOVA is not satisfied. We will continue with the ANOVA analysis, but we will also use non-parametric tests to see if the variances are the same across composite industries. Specifically, we believe it best to use the Kruskal-Wallace test as it makes no assumptions of normality within each group or that the variances be equal (have a ratio less than 2). We examined our normal quantile plots before, and saw that the data was relatively normally distributed, but not for all genres. Thus, Welch’s ANOVA is not a solid choice in this scenario.

We begin with ANOVA:

aov1 <- aov(aggcompfin$`Composite Size` ~ aggcompfin$CompInd)
summary(aov1)
##                      Df    Sum Sq   Mean Sq F value Pr(>F)    
## aggcompfin$CompInd    7 9.796e+08 139942538   45.67 <2e-16 ***
## Residuals          4304 1.319e+10   3063949                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#We use Tukey comparisons to see differences in the mean size between composite industries
TukeyHSD(aov1)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = aggcompfin$`Composite Size` ~ aggcompfin$CompInd)
## 
## $`aggcompfin$CompInd`
##                                 diff         lwr        upr     p adj
## Consumer-Commodities     1206.271627   890.66903  1521.8742 0.0000000
## Financials-Commodities    -82.553203  -462.12074   297.0143 0.9979535
## Healthcare-Commodities    -46.244033  -404.10374   311.6157 0.9999346
## Industrials-Commodities   -98.808695  -408.72429   211.1069 0.9789907
## Other-Commodities          -2.411807  -333.72552   328.9019 1.0000000
## TMT-Commodities           186.756253  -135.81413   509.3266 0.6503252
## Utilities-Commodities     437.013672  -182.81753  1056.8449 0.3904411
## Financials-Consumer     -1288.824830 -1628.84387  -948.8058 0.0000000
## Healthcare-Consumer     -1252.515660 -1568.11826  -936.9131 0.0000000
## Industrials-Consumer    -1305.080322 -1565.05937 -1045.1013 0.0000000
## Other-Consumer          -1208.683435 -1493.83288  -923.5340 0.0000000
## TMT-Consumer            -1019.515374 -1294.45733  -744.5734 0.0000000
## Utilities-Consumer       -769.257955 -1365.69081  -172.8251 0.0023703
## Healthcare-Financials      36.309170  -343.25837   415.8767 0.9999916
## Industrials-Financials    -16.255492  -351.00260   318.4916 0.9999999
## Other-Financials           80.141395  -274.50875   434.7915 0.9973899
## TMT-Financials            269.309456   -77.18672   615.8056 0.2631882
## Utilities-Financials      519.566875  -113.04569  1152.1794 0.1995995
## Industrials-Healthcare    -52.564662  -362.48026   257.3509 0.9995946
## Other-Healthcare           43.832225  -287.48149   375.1459 0.9999233
## TMT-Healthcare            233.000286   -89.57010   555.5707 0.3576765
## Utilities-Healthcare      483.257705  -136.57350  1103.0889 0.2593958
## Other-Industrials          96.396887  -182.44515   375.2389 0.9668972
## TMT-Industrials           285.564948    17.17016   553.9597 0.0276638
## Utilities-Industrials     535.822367   -57.62083  1129.2656 0.1116207
## TMT-Other                 189.168060  -103.67466   482.0108 0.5101712
## Utilities-Other           439.425480  -165.46783  1044.3188 0.3500421
## Utilities-TMT             250.257419  -349.89156   850.4064 0.9118438
par(mar=c(5,11,4,1))
plot(TukeyHSD(aov1), las=1)

#Finally, we examine our residuals for the ANOVA model
myResPlots2(aov1, label = "Size Composite Industry")

Analysis: Our anova model shows that the mean composite size is statistically significantly different between composite industries given that the p-value is smaller than any reasonable value of alpha. The degrees of freedom reported by the test are what we expect - k (# of groups) - 1 degrees of freedom for composite industries. Moreover, the degrees of freedom for the residuals are what we expect: the number of observations - the number of groups. Examining the normal quantile plot we see that our errors are not normally distributed and the fits vs studentized residuals plot shows evidence of heteroskedasticity. We should therefore take the results of this test with a grain of salt because the assumptions for ANOVA were not suffuciently satisfied.

Looking at the Tukey Comparisons, we can see that there are many pairs of industries that have statistically significantly different means. These pairs include (Consumer-Commodities), (Financials-Consumer), (Industrials-Consumer), (Healthcare-Consumer), (Other-Consumer), (TMT-Consumer), (Utilities-Consumer), and (TMT-Industrials).

#Now, let's see what transformation is suggested
trans1 <- boxCox(aov1)

trans1$x[which.max(trans1$y)]
## [1] -0.1414141
#The suggested transformation is a log. Let's see what happens to our ANOVA model if we use the transformation. The transformation makes sense given that size is on a dollar scale.
transsize <- log(aggcompfin$`Composite Size`)

print("SD by Genre")
## [1] "SD by Genre"
(sds <- tapply(transsize, aggcompfin$CompInd, sd))
## Commodities    Consumer  Financials  Healthcare Industrials       Other 
##    1.403026    2.056953    1.553958    1.010922    1.481583    1.544122 
##         TMT   Utilities 
##    1.205141    1.396993
print("Ratio of Max/Min Sample SD")
## [1] "Ratio of Max/Min Sample SD"
round(max(sds)/min(sds),1)
## [1] 2
#Now the ratio is less than 2 and our assumption for ANOVA is satisfied. We will poroceed with the transformed model
aov2 <- aov(transsize ~ aggcompfin$CompInd)
summary(aov2)
##                      Df Sum Sq Mean Sq F value Pr(>F)    
## aggcompfin$CompInd    7    403   57.54   24.53 <2e-16 ***
## Residuals          4304  10094    2.35                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis:Our second model, using our transformed model in which we took the log of size, still shows statistically significant differences between composite industries given that the p-value is lower than any reasonable alpha.

#pairwise.t.test(transsize, aggcompfin$CompInd) 

#Fix the labels on the side
TukeyHSD(aov2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = transsize ~ aggcompfin$CompInd)
## 
## $`aggcompfin$CompInd`
##                                diff         lwr          upr     p adj
## Consumer-Commodities     0.16673814 -0.10937741  0.442853691 0.5986769
## Financials-Commodities  -0.38125154 -0.71332897 -0.049174107 0.0118221
## Healthcare-Commodities   0.37720292  0.06411732  0.690288528 0.0063752
## Industrials-Commodities -0.25270631 -0.52384640  0.018433780 0.0888323
## Other-Commodities       -0.29615796 -0.58601891 -0.006297015 0.0411191
## TMT-Commodities          0.44295221  0.16074066  0.725163763 0.0000547
## Utilities-Commodities    0.49616512 -0.04611506  1.038445294 0.1016809
## Financials-Consumer     -0.54798968 -0.84546677 -0.250512586 0.0000007
## Healthcare-Consumer      0.21046478 -0.06565077  0.486580332 0.2874032
## Industrials-Consumer    -0.41944445 -0.64689587 -0.191993030 0.0000007
## Other-Consumer          -0.46289610 -0.71236868 -0.213423521 0.0000006
## TMT-Consumer             0.27621407  0.03567185  0.516756293 0.0117949
## Utilities-Consumer       0.32942698 -0.19238236  0.851236319 0.5409444
## Healthcare-Financials    0.75845446  0.42637703  1.090531894 0.0000000
## Industrials-Financials   0.12854523 -0.16431954  0.421409998 0.8871891
## Other-Financials         0.08509358 -0.22518403  0.395371183 0.9913393
## TMT-Financials           0.82420375  0.52105992  1.127347585 0.0000000
## Utilities-Financials     0.87741666  0.32395427  1.430879044 0.0000433
## Industrials-Healthcare  -0.62990923 -0.90104932 -0.358769143 0.0000000
## Other-Healthcare        -0.67336088 -0.96322183 -0.383499938 0.0000000
## TMT-Healthcare           0.06574929 -0.21646226  0.347960841 0.9968348
## Utilities-Healthcare     0.11896220 -0.42331798  0.661242371 0.9978379
## Other-Industrials       -0.04345165 -0.28740599  0.200502685 0.9994401
## TMT-Industrials          0.69565852  0.46084432  0.930472724 0.0000000
## Utilities-Industrials    0.74887143  0.22967769  1.268065167 0.0003334
## TMT-Other                0.73911017  0.48290687  0.995313475 0.0000000
## Utilities-Other          0.79232308  0.26311182  1.321534343 0.0001560
## Utilities-TMT            0.05321291 -0.47184762  0.578273428 0.9999875
par(mar=c(5,11,4,1))
plot(TukeyHSD(aov2), las=1)

myResPlots2(aov2, label = "Size Composite Industry")

Analysis: [ADD IN ANALYSIS OF THE TUKEY COMPARISONS] The errors of our new ANOVA model still do not appear normally distributed, but the fits vs residuals plot shows less evidence of heteroskedasticity. This is an improvement from the prior model.

Before we continue with our non-parametric tests, we will use Bartlett’s and Levene’s test to see if variances are the equal.

bartlett.test(transsize, aggcompfin$CompInd)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  transsize and aggcompfin$CompInd
## Bartlett's K-squared = 364.93, df = 7, p-value < 2.2e-16

Analysis: Based on Bartlett’s test, we reject the null-hypothesis that variances are homogeneous, a result which could be due to non-normality given that Bartlett test assumes that the data is drawn from a normal distribution (and transsize, as seen above, is not normal).

leveneTest(transsize, aggcompfin$CompInd)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    7    60.6 < 2.2e-16 ***
##       4304                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis: Once again, the F-value is very high and the p-value is much lower than any reasonable alpha meaning we reject the null hypothesis that the variances are homogenous (Levene’s test makes no assumptions about normality which means its more appropriate for analyzing pctreturn).

Based on the results of Bartlett’s test, we find it best to use Kruskal’s test as our non-parametric test as it makes no assumptions of normality or equality of variances. Now, we proceed with our non-parametric tests.

#Now, with Kruskal's test
kruskal.test(transsize ~ aggcompfin$CompInd)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  transsize by aggcompfin$CompInd
## Kruskal-Wallis chi-squared = 213.83, df = 7, p-value < 2.2e-16
#Comparing to One-Way ANOVA
summary.aov(aov(transsize ~ aggcompfin$CompInd))
##                      Df Sum Sq Mean Sq F value Pr(>F)    
## aggcompfin$CompInd    7    403   57.54   24.53 <2e-16 ***
## Residuals          4304  10094    2.35                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis:The chi-squared value for the Kruskal-Wallis test is high and the p-value is small enough to reject the null hypothesis at any reasonable level of alpha meaning that at least one group median is different from others. As stated earlier, The Kurskal-Wallis test is a good choice here given that the results are statistically significant and our original data is not normally distributed within each genre and the variances are not the same between groups (based on the results from Levene and Bartlette tests).

Final Comments

At the beginning of this project we set out with two main goals: to explore the variability in returns and characteristics of different industries over time, and to test the CAPM. Regarding the first objective, we observed that returns were actually very similar for most industries in the long run. From our tests, we didn’t see much statistically dignificant difference in the mean returns since 1926 of each industry. However, we did notice variablility in some of the other metrics, such as size or BE/ME. For example, we noticed that TMT companies have a statistically significantly lower BE/ME ratio than commodities, meaning that the market gives them a higher valuation relative to their tangible assets. We also noticed that over time, certain companies, expecially technology companies, tend to be larger in market value than firms in other sectors. We also rejected the CAPM by exploring whether market risk was able to explain the yearly cross section of industry returns. We found that there were unexplained sources of return in the model, meaning that either markets do not efficiently compensate investors for the risk they take or there are other sources of risk we did not include in the model (technically one could also argue that our estimates for beta were wrong, but it is hard to imagine that a small amount of estimation error could cause a t stat of 5). Either way, we showed with a high degree of statistical certainty that CAPM doesn’t work.